Food Delivery Time Prediction - Analysis¶

Given the number of methods and models, the code may take a few minutes to execute.
If you want to view the results without running it, go to the file Food_delivery_analisis.html.


Setup¶

As a first step, we import the necessary libraries and tools.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats

from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from scipy.stats import chi2_contingency

from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize, poly)

from functools import partial
from sklearn.model_selection import (cross_validate, KFold, ShuffleSplit, train_test_split, cross_val_score, GridSearchCV)
from sklearn.base import clone
from ISLP.models import sklearn_sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import confusion_matrix, mean_absolute_error, r2_score

The dataset Food_Delivery_Times is imported from the Git directory.
After examining the contents of the database, we observe the presence of null values. It has been decided to remove these rows.
Null values are removed because they can interfere with the calculations of the model, leading to inaccurate results.

In [2]:
Food = pd.read_csv("dataset/Food_Delivery_Times.csv")
print("Number of rows in the original dataset:", len(Food))

null_rows = Food.isnull().any(axis=1).sum()
print("Number of rows with missing values:", null_rows)

Food = Food.dropna()
print("Number of rows after removing missing values:", len(Food))

Food.head()
Number of rows in the original dataset: 1000
Number of rows with missing values: 117
Number of rows after removing missing values: 883
Out[2]:
Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type Preparation_Time_min Courier_Experience_yrs Delivery_Time_min
0 522 7.93 Windy Low Afternoon Scooter 12 1.0 43
1 738 16.42 Clear Medium Evening Bike 20 2.0 84
2 741 9.52 Foggy Low Night Scooter 28 1.0 59
3 661 7.44 Rainy Medium Afternoon Scooter 5 1.0 37
4 412 19.03 Clear Low Morning Bike 16 5.0 68

Outliers are extreme values that can distort analysis and lead to misleading results.
We used the Interquartile Range (IQR) method to detect and remove them.
This ensures cleaner data and more reliable outcomes in the following analyses.

In [3]:
numeric_columns = Food.select_dtypes(include=['float64', 'int64']).columns

Food_original = Food.copy()

def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

for col in numeric_columns:
    Food = remove_outliers(Food, col)

plt.figure(figsize=(18, 12))
for i, col in enumerate(numeric_columns, 1):

    plt.subplot(2, len(numeric_columns), i)
    sns.boxplot(x=Food_original[col])
    plt.title(f"{col} - Before")

    plt.subplot(2, len(numeric_columns), len(numeric_columns) + i)
    sns.boxplot(x=Food[col])
    plt.title(f"{col} - After")

plt.tight_layout()
plt.show()

print("Number of rows after removing outliers", len(Food))
No description has been provided for this image
Number of rows after removing outliers 879

The boxplots show the distribution of numerical variables before and after outlier removal.
The dataset shows good quality as it contains few outliers. Extreme points, especially in Delivery_Time_min, were removed, making the data more compact.
The cleaned data is now more consistent and less influenced by rare or abnormal values.


Simple Linear Regression¶

Before building regression models, we assess the quality of the dataset by analyzing the correlation between variables through the construction of a correlation matrix.

In [4]:
numeric_df = Food.select_dtypes(include=[np.number])

if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(10, 8))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()
No description has been provided for this image

In our case, some variables show a clear linear relationship with delivery time, indicating their potential importance in the regression model.
Additionally, the predictors are not highly correlated with each other, suggesting that the dataset is suitable for linear regression and that multicollinearity is not a concern.

The three numerical variables (Distance_km, Preparation_Time_min, Courier_Experience_yrs) are selected, and the relationship between each of them and the response variable Delivery_Time_min is analyzed individually.
The variable Order_ID is completely irrelevant to our objective, so we do not take it into account.

  • First case:¶

Delivery_Time_min = b0 + b1 * Distance_km + e¶

In [5]:
X = pd.DataFrame({'intercept': np.ones(Food.shape[0]), 'Distance_km': Food['Distance_km']})
X[:4]
Out[5]:
intercept Distance_km
0 1.0 7.93
1 1.0 16.42
2 1.0 9.52
3 1.0 7.44
In [6]:
y = Food['Delivery_Time_min']
model = sm.OLS(y, X) 
results = model.fit() 
summarize(results)
Out[6]:
coef std err t P>|t|
intercept 26.8673 0.890 30.184 0.0
Distance_km 2.9180 0.077 37.746 0.0

Distance_km has a p-value of 0.0, indicating that it is highly significant.
The coefficient is positive, meaning that as the distance increases, the waiting time also increases.
The intercept is high, suggesting that distance alone does not sufficiently explain the delivery times.

Model prediction and confidence intervals¶

We define a linear regression design matrix including an intercept and the explanatory variable Distance_km using the ModelSpec class from ISLP.
This allows for consistent and automatic construction of the design matrix for both the original dataset and new observations.

We then generate predictions and 95% confidence intervals for new values of Distance_km based on the previously fitted model.

In [7]:
model = MS(['Distance_km'])
model = model.fit(Food) 
X = model.transform(Food)
X[:4]
Out[7]:
intercept Distance_km
0 1.0 7.93
1 1.0 16.42
2 1.0 9.52
3 1.0 7.44
In [8]:
new_df = pd.DataFrame({'Distance_km':[10, 12, 14, 16, 20]})
newX = model.transform(new_df) 
newX
Out[8]:
intercept Distance_km
0 1.0 10
1 1.0 12
2 1.0 14
3 1.0 16
4 1.0 20
In [9]:
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
Out[9]:
array([56.047364  , 61.88336722, 67.71937045, 73.55537368, 85.22738013])
In [10]:
new_predictions.conf_int(alpha=0.05)
Out[10]:
array([[55.18622211, 56.90850588],
       [60.97123424, 62.79550021],
       [66.66742956, 68.77131134],
       [72.30423639, 74.80651096],
       [83.48514315, 86.96961711]])

These intervals suggest that the model provides reasonably tight estimates for the predicted waiting times, with relatively small ranges around each predicted value.
This indicates a good level of precision in the model's predictions.

Graphs¶

  1. Regression line¶

In [11]:
def abline(ax, b, m, *args, **kwargs):
    xlim = ax.get_xlim()
    ylim = [m * xlim[0] + b, m * xlim[1] + b]
    ax.plot(xlim, ylim, *args, **kwargs)

ax = Food.plot.scatter('Distance_km', 'Delivery_Time_min')
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=3)
No description has been provided for this image

The scatter plot shows a positive correlation between distance in kilometers and delivery time in minutes.
The data points align well with the linear trend line, indicating that the model is accurate and useful for predicting delivery times based on distance.

  1. Fitted values vs residuals¶

In [12]:
fig, ax = plt.subplots(figsize=(6, 6)) 
ax.scatter(results.fittedvalues, results.resid)  
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--')   
Out[12]:
<matplotlib.lines.Line2D at 0x249af84a850>
No description has been provided for this image

The residuals are spread around the zero line, indicating that the model fits the data well.
There are no clear patterns in the residuals, suggesting the model's assumptions about linearity are valid.

  1. Leverage¶

In [13]:
infl = results.get_influence()
ax = plt.subplots(figsize=(6, 6))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag)
Out[13]:
np.int64(354)
No description has been provided for this image

The leverage values are low, meaning no single data point has too much impact on the model.
The leverage level for outliers is 0.0045. Since almost all data points are below this threshold, the model is well-balanced.

  • Second case:¶

Delivery_Time_min = b0 + b1 * Preparation_Time_min + e¶

In [14]:
X = pd.DataFrame({'intercept': np.ones(Food.shape[0]), 'Preparation_Time_min': Food['Preparation_Time_min']})
X[:4]
Out[14]:
intercept Preparation_Time_min
0 1.0 12
1 1.0 20
2 1.0 28
3 1.0 5
In [15]:
y = Food['Delivery_Time_min']
model = sm.OLS(y, X) 
results = model.fit() 
summarize(results)
Out[15]:
coef std err t P>|t|
intercept 41.2948 1.727 23.913 0.0
Preparation_Time_min 0.8703 0.093 9.322 0.0

Preparation_Time_min has a p-value of 0.0, indicating that it is highly significant.
The coefficient is positive, meaning that as the preparation time increases, the waiting time also increases.
The intercept is relatively high, suggesting that preparation time alone does not fully explain the total waiting time.

Model prediction and confidence intervals¶

In [16]:
model = MS(['Preparation_Time_min'])
model = model.fit(Food) 
X = model.transform(Food)
X[:4]
Out[16]:
intercept Preparation_Time_min
0 1.0 12
1 1.0 20
2 1.0 28
3 1.0 5
In [17]:
new_df = pd.DataFrame({'Preparation_Time_min':[10, 12, 14, 16, 20]})
newX = model.transform(new_df)
newX
Out[17]:
intercept Preparation_Time_min
0 1.0 10
1 1.0 12
2 1.0 14
3 1.0 16
4 1.0 20
In [18]:
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
Out[18]:
array([49.99800226, 51.73863378, 53.47926529, 55.21989681, 58.70115983])
In [19]:
new_predictions.conf_int(alpha=0.05)
Out[19]:
array([[48.14830241, 51.84770211],
       [50.12183792, 53.35542964],
       [52.03868239, 54.91984819],
       [53.87634173, 56.56345188],
       [57.26216877, 60.1401509 ]])

These intervals indicate that the model produces slightly less accurate predictions for the waiting times than the previous model, with larger ranges surrounding each predicted value.
This suggests that the model has a good degree of precision in its estimates.

Graphs¶

  1. Regression line¶

In [20]:
def abline(ax, b, m, *args, **kwargs):
    xlim = ax.get_xlim()
    ylim = [m * xlim[0] + b, m * xlim[1] + b]
    ax.plot(xlim, ylim, *args, **kwargs)

ax = Food.plot.scatter('Preparation_Time_min', 'Delivery_Time_min')
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=3)
No description has been provided for this image

The scatter plot shows a positive correlation between preparation time and delivery time in minutes.
The data points generally align with the linear trend line, suggesting that the model is reasonably good.
However, compared to the previous model, the dispersion of the data points appears to increase.
Additionally, there is noticeable vertical overlap between points at similar time values, indicating that other factors beyond distance may also influence delivery times.

  1. Fitted values vs residuals¶

In [21]:
fig, ax = plt.subplots(figsize=(6, 6)) 
ax.scatter(results.fittedvalues, results.resid)  
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--');  
No description has been provided for this image

The residuals appear to be fairly symmetrically distributed above and below the zero line, which supports the assumption of linearity in the model.
However, the residuals are widely spread, showing a noticeable vertical dispersion.

  1. Leverage¶

In [22]:
infl = results.get_influence()
ax = plt.subplots(figsize=(6, 6))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag);
No description has been provided for this image

Leverage values across the dataset are low, indicating that no individual observation exerts excessive influence on the model's predictions.
The threshold for high leverage is 0.0045, and all points fall below this value.
From this point of view, the model is very good.

  • Third case:¶

Delivery_Time_min = b0 + b1 * Courier_Experience_yrs + e¶

In [23]:
X = pd.DataFrame({'intercept': np.ones(Food.shape[0]), 'Courier_Experience_yrs': Food['Courier_Experience_yrs']})
X[:4]
Out[23]:
intercept Courier_Experience_yrs
0 1.0 1.0
1 1.0 2.0
2 1.0 1.0
3 1.0 1.0
In [24]:
y = Food['Delivery_Time_min']
model = sm.OLS(y, X) 
results = model.fit() 
summarize(results)
Out[24]:
coef std err t P>|t|
intercept 58.7422 1.328 44.222 0.000
Courier_Experience_yrs -0.5695 0.242 -2.352 0.019

Courier_Experience_yrs has a p-value of 0.022, indicating that it is significant.
The coefficient is negative, meaning that as courier experience increases, the waiting time tends to decrease.
The intercept is relatively high, suggesting that courier experience alone does not fully account for the variability in waiting times.
Compared to the previous models, Courier_Experience_yrs has a smaller effect size and slightly higher uncertainty, as reflected in the standard error and t-statistic.

Model prediction and confidence intervals¶

In [25]:
model = MS(['Courier_Experience_yrs'])
model = model.fit(Food)
X = model.transform(Food)
X[:4]
Out[25]:
intercept Courier_Experience_yrs
0 1.0 1.0
1 1.0 2.0
2 1.0 1.0
3 1.0 1.0
In [26]:
new_df = pd.DataFrame({'Courier_Experience_yrs':[1, 3, 5, 7, 10]})
newX = model.transform(new_df)
newX
Out[26]:
intercept Courier_Experience_yrs
0 1.0 1
1 1.0 3
2 1.0 5
3 1.0 7
4 1.0 10
In [27]:
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
Out[27]:
array([58.1727115 , 57.03372218, 55.89473287, 54.75574356, 53.04725959])
In [28]:
new_predictions.conf_int(alpha=0.05)
Out[28]:
array([[55.95314347, 60.39227952],
       [55.43959247, 58.62785189],
       [54.49350217, 57.29596357],
       [52.96882553, 56.54266159],
       [50.14442453, 55.95009465]])

The intervals show that the model generates fairly precise predictions for the waiting times, with usually small variations around each predicted value.
This reflects a decent level of accuracy in the model's forecasts.

Graphs¶

  1. Regression line¶

In [29]:
def abline(ax, b, m, *args, **kwargs):
    xlim = ax.get_xlim()
    ylim = [m * xlim[0] + b, m * xlim[1] + b]
    ax.plot(xlim, ylim, *args, **kwargs)

ax = Food.plot.scatter('Courier_Experience_yrs', 'Delivery_Time_min')
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=3)
No description has been provided for this image

The plot shows that the trend line is slightly downward, indicating a weak negative relationship between preparation time and delivery time.
The data points are highly dispersed around the trend line, suggesting that the model does not adequately explain the variability in the data.
Additionally, there is considerable vertical overlap between the points, which indicates that factors beyond experience may significantly influence delivery times.
This suggests the possibility of a nonlinear relationship, and the model is likely the worst one so far.

  1. Fitted values vs residuals¶

In [30]:
fig, ax = plt.subplots(figsize=(6, 6)) 
ax.scatter(results.fittedvalues, results.resid)  
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--');
No description has been provided for this image

In this case as well, the residuals are symmetric but very large, indicating poor accuracy.

  1. Leverage¶

In [31]:
infl = results.get_influence()
ax = plt.subplots(figsize=(6, 6))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag);
No description has been provided for this image

No point has high leverage, meaning no single point dominates the model's estimation.
This is positive, but it also confirms that no single observation provides strong information to the model, which is consistent with the fact that the model does not adequately explain the variability in the data.


Multiple Linear Regression¶

  • Numerical predictors¶

Delivery_Time_min = b0 + b1 * Distance_km + b2 * Preparation_Time_min + b3 * Courier_Experience_yrs + e¶

In [32]:
X = MS(['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs']).fit_transform(Food)
y = Food['Delivery_Time_min']

model = sm.OLS(y, X)
results = model.fit()
summarize(results)
Out[32]:
coef std err t P>|t|
intercept 13.0071 1.320 9.854 0.0
Distance_km 2.9493 0.066 45.000 0.0
Preparation_Time_min 0.9273 0.051 18.096 0.0
Courier_Experience_yrs -0.4803 0.127 -3.777 0.0

The results of the linear regression model indicate that distance, preparation time, and courier experience significantly influence food delivery time (p-values = 0).
Specifically, delivery time increases by approximately 3.00 minutes per additional kilometer.
It also increases by 1 minutes for each additional minute of preparation.
In contrast, each additional year of courier experience reduces delivery time by about 0.5 minutes.

In [33]:
print("R2", results.rsquared) 
print("RSE", np.sqrt(results.scale))
R2 0.7269698515442016
RSE 11.024219845632128

The linear regression model explains approximately 73% of the variance in food delivery time, indicating a strong fit.
The residual standard error (RSE) is 11 minutes, suggesting a moderate prediction error given the likely range of delivery times.
Overall, the model performs well.

  • Qualitative predictors¶

Now we examine the remaining variables, which are categorical.

In [34]:
for col in Food.columns:
    if Food[col].dtype == 'object':
        print(f"{col}: {Food[col].unique()}")
Weather: ['Windy' 'Clear' 'Foggy' 'Rainy' 'Snowy']
Traffic_Level: ['Low' 'Medium' 'High']
Time_of_Day: ['Afternoon' 'Evening' 'Night' 'Morning']
Vehicle_Type: ['Scooter' 'Bike' 'Car']

Before creating the new model, we verify that the categorical variables are not strongly correlated with each other, ensuring that each one contributes useful information to the model.
Since the variables are not numerical, we cannot use a standard correlation matrix.
Instead, we use Cramér's V, a statistical measure of association between two categorical variables, which ranges from 0 (no association) to 1 (perfect association).

In [35]:
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) 
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

cramer_matrix = pd.DataFrame(
    np.zeros((len(categorical), len(categorical))),
    index=categorical,
    columns=categorical
)

for var1 in categorical:
    for var2 in categorical:
        cramer_matrix.loc[var1, var2] = cramers_v(Food[var1], Food[var2])

plt.figure(figsize=(10, 8))
sns.heatmap(cramer_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Cramér's V of Categorical Features")
plt.show()
No description has been provided for this image

The Cramér’s V matrix shows very low values, indicating that the categorical variables are essentially independent from each other.
No significant associations emerge, suggesting that each variable provides distinct and non-redundant information to the model.

Using the tools from ISLP, we build a regression model that includes all variables, both numerical and categorical.
Categorical variables are automatically handled by the library through one-hot encoding, a process that transforms each category into a separate binary variable.
This allows the model to interpret categorical information in a numerical form suitable for linear regression.

In [36]:
model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Time_of_Day',
    'Vehicle_Type',
    'Preparation_Time_min',
    'Courier_Experience_yrs'
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']

model = sm.OLS(y, X)
results = model.fit()
summarize(results)
Out[36]:
coef std err t P>|t|
intercept 18.4314 1.467 12.564 0.000
Distance_km 2.9055 0.056 51.622 0.000
Weather[Foggy] 8.0616 1.059 7.611 0.000
Weather[Rainy] 4.9402 0.832 5.938 0.000
Weather[Snowy] 10.2738 1.126 9.128 0.000
Weather[Windy] 1.9968 1.129 1.769 0.077
Traffic_Level[Low] -12.2613 0.869 -14.102 0.000
Traffic_Level[Medium] -6.0433 0.865 -6.984 0.000
Time_of_Day[Evening] 0.0829 0.826 0.100 0.920
Time_of_Day[Morning] -0.8775 0.817 -1.074 0.283
Time_of_Day[Night] -1.4297 1.238 -1.155 0.248
Vehicle_Type[Car] -0.3326 0.854 -0.389 0.697
Vehicle_Type[Scooter] -1.3290 0.739 -1.799 0.072
Preparation_Time_min 0.9447 0.044 21.507 0.000
Courier_Experience_yrs -0.5556 0.109 -5.081 0.000

From the regression results, several observations can be made:

  • Distance_km has a strong and significant positive effect on delivery time.
  • Adverse weather conditions like Foggy, Rainy, and Snowy significantly increase delivery time, compared to Clear weather.
  • Traffic_Level[Low] and Medium significantly reduce delivery time compared to High traffic, which is the baseline.
  • Different times of the day (Morning, Evening, Night) do not show statistically significant differences (p-values > 0.05) from the baseline (Afternoon).
  • Similarly, neither Car nor Scooter types show significant differences (p-values > 0.05) from the baseline vehicle (Bike), suggesting that vehicle type may not strongly influence delivery time in this model.
  • Preparation_Time_min significantly increases delivery time, as expected.
  • Interestingly, Courier_Experience_yrs is negatively associated with delivery time, indicating that more experienced couriers tend to deliver faster.
In [37]:
print("R2", results.rsquared) 
print("RSE", np.sqrt(results.scale))
R2 0.803281642221906
RSE 9.416991168829025

The linear regression model now explains approximately 80% of the variance in food delivery time, indicating an even stronger fit compared to the previous model (73%).
The residual standard error is now 10 minutes, suggesting a slight improvement in prediction accuracy, with a lower error compared to the previous model (11 minutes).
Overall, the model has improved in both explanatory power and prediction accuracy.

Observing the results, we notice that the p-values for Veicle_Type and Time_of_Day are high.
Before removing them, we aim to better understand why vehicle type does not affect delivery time.
In real scenarios, using a car should reduce delivery time compared to using a bike.

In [38]:
print(Food['Vehicle_Type'].value_counts())
Vehicle_Type
Bike       450
Scooter    259
Car        170
Name: count, dtype: int64
In [39]:
print(Food.groupby('Vehicle_Type', observed=False)['Delivery_Time_min'].mean())
Vehicle_Type
Bike       56.473333
Car        56.841176
Scooter    54.965251
Name: Delivery_Time_min, dtype: float64
In [40]:
print(Food.groupby('Vehicle_Type', observed=False)['Distance_km'].mean())
Vehicle_Type
Bike       10.001067
Car        10.195471
Scooter     9.931197
Name: Distance_km, dtype: float64

We obtained the following results:

  • As previously mentioned, Cramer's V matrix shows that the association between Vehicle_Type and other categorical variables is nearly nonexistent (maximum value 0.05).
  • The distribution of Vehicle_Type is unbalanced: 450 deliveries by bike, 259 by scooter, and 170 by car, despite this, each group has sufficient size to estimate means robustly.
  • The analysis of average delivery times per vehicle type revealed similar values: 56.47 minutes for bike, 54.96 for scooter, and 56.84 for car.
  • Similarly, the average distance traveled by each vehicle type was nearly the same (around 10 km) and this suggests that vehicle type is not associated with relevant differences in either distance or average time.

These results justify the high p-value and the absence of effect may indicate that logistic differences between vehicles are offset by other operational factors not included in the dataset (e.g., city routes, assigned paths, traffic, etc.).

We now return to the model and remove, as previously anticipated, Vehicle_Type and Time_of_Day

In [41]:
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')

model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Preparation_Time_min',
    'Courier_Experience_yrs',
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']


model = sm.OLS(y, X)
results = model.fit()
summarize(results)
Out[41]:
coef std err t P>|t|
intercept 17.5478 1.348 13.019 0.000
Distance_km 2.9057 0.056 51.668 0.000
Weather[Foggy] 8.0511 1.059 7.602 0.000
Weather[Rainy] 4.9947 0.831 6.011 0.000
Weather[Snowy] 10.4361 1.120 9.318 0.000
Weather[Windy] 2.0469 1.126 1.818 0.069
Traffic_Level[Low] -12.2128 0.869 -14.056 0.000
Traffic_Level[Medium] -6.0979 0.865 -7.049 0.000
Preparation_Time_min 0.9446 0.044 21.509 0.000
Courier_Experience_yrs -0.5510 0.109 -5.040 0.000
In [42]:
print("R2", results.rsquared) 
print("RSE", np.sqrt(results.scale))
R2 0.8019562697725547
RSE 9.421439203442182

By removing the two variables, we obtained a simpler and more interpretable model.
This was achieved without losing the good performance of the previous full model: R² and RSE values remain nearly identical to before.

In [43]:
y_pred = results.fittedvalues

plt.figure(figsize=(6, 4))
plt.scatter(y, y_pred, alpha=0.6, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title('Observed Values vs Predicted Values')
plt.grid(True)
plt.show()

residuals = results.resid

plt.figure(figsize=(6, 4))
plt.scatter(y_pred, residuals, alpha=0.6, color='darkorange')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.grid(True)
plt.show()

sns.histplot(residuals, kde=True, stat="density", color="skyblue", bins=30)
plt.title("Residual Distribution")
plt.xlabel("Residual")
plt.ylabel("Density")
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observing the graphs, we noticed the presence of outliers.
To improve the robustness of the regression model, we removed outliers based on the distribution of residuals from an initial OLS fit.
Residuals are assumed to follow a normal distribution (as confirmed by the previous plot above), and a threshold based on the standard deviation is applied to identify anomalous observations.
Instead of the conventional ±3 standard deviations, we adopted a slightly more restrictive cutoff of ±2.9 standard deviations.
This choice allowed us to remove all significant outliers without discarding a substantial portion of the data.

In [44]:
residuals = results.resid

residuals_std = np.std(residuals)
threshold = 2.9 * residuals_std

mask = np.abs(residuals) < threshold
Food = Food.loc[mask].copy()

print(f"Number of rows after removing outliers: {len(Food)}")
Number of rows after removing outliers: 860
In [45]:
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')

model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Preparation_Time_min',
    'Courier_Experience_yrs',
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']

model = sm.OLS(y, X)
results = model.fit()
summarize(results)
Out[45]:
coef std err t P>|t|
intercept 14.5238 0.966 15.037 0.0
Distance_km 2.9635 0.040 74.029 0.0
Weather[Foggy] 6.3084 0.767 8.225 0.0
Weather[Rainy] 4.7894 0.592 8.095 0.0
Weather[Snowy] 9.9081 0.799 12.403 0.0
Weather[Windy] 2.9583 0.795 3.721 0.0
Traffic_Level[Low] -11.2013 0.621 -18.025 0.0
Traffic_Level[Medium] -5.4008 0.619 -8.725 0.0
Preparation_Time_min 0.9985 0.031 31.932 0.0
Courier_Experience_yrs -0.5361 0.078 -6.883 0.0
In [46]:
print("R2", results.rsquared) 
print("RSE", np.sqrt(results.scale))
R2 0.894431839175429
RSE 6.643357633887748

The R² value has improved by about 10%, nearly reaching 90%.
The error in delivery time is now approximately 6-7 minutes.
Additionally, only a few rows were removed, with the dataset decreasing from 879 to 860.

In [47]:
plt.figure(figsize=(6, 4))
plt.scatter(y, results.fittedvalues, alpha=0.5, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title('Observed vs Predicted Values (after cleaning)')
plt.show()

plt.figure(figsize=(6, 4))
plt.scatter(results.fittedvalues, results.resid, alpha=0.5, color='darkorange')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values (after cleaning)')
plt.show()
No description has been provided for this image
No description has been provided for this image

However, from the Residuals vs Predicted Values plot, we can observe a clear cone-shaped pattern, with residuals starting from 0 and ranging between -20 and +20.
This pattern indicates heteroscedasticity, meaning the variance of the residuals increases as the predicted values grow.
In other words, the model's error is not constant across all levels of prediction, but rather increases as the predicted values move further from 0.
A solution to this phenomenon is to apply logarithmic or square root transformations to the model.
These transformations stabilize the variance by compressing the scale of larger predicted values, making the residuals more homoscedastic and improving the model's assumptions.

In [48]:
def transform_y(y, method):
    if method == 'log':
        return np.log1p(y)  
    elif method == 'sqrt':
        return np.sqrt(y)
    else:
        raise ValueError(f"Error")

transformations = ['log', 'sqrt']

results_summary = {}

categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')

model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Preparation_Time_min',
    'Courier_Experience_yrs',
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']

for transform in transformations:
    print(f"\nTransformation: {transform}")

    y_transformed = transform_y(y, transform)
    
    mod = sm.OLS(y_transformed, X)
    res = mod.fit()

    results_summary[transform] = res.rsquared

    y_pred = res.predict(X)

    if transform == 'log':
        y_pred_plot = np.expm1(y_pred)
    elif transform == 'sqrt':
        y_pred_plot = np.square(y_pred)
    else:
        y_pred_plot = y_pred

    plt.figure(figsize=(6, 4))
    plt.scatter(y, y_pred_plot, alpha=0.5, color='royalblue')
    plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--')
    plt.xlabel('Observed Values')
    plt.ylabel('Predicted Values')
    plt.title(f'Observed vs Predicted - {transform}')
    plt.grid(True)
    plt.show()

    residuals = y_transformed - y_pred
    plt.figure(figsize=(6, 4))
    plt.scatter(y_pred, residuals, color='darkorange', alpha=0.5)
    plt.axhline(y=0, color='k', linestyle='--')
    plt.xlabel('Predicted Values')
    plt.ylabel('Residuals')
    plt.title(f'Residuals vs Predicted - {transform}')
    plt.grid(True)
    plt.show()

print("\nR² for each transformation:")
for key, value in results_summary.items():
    print(f"{key}: {value:.4f}")
Transformation: log
No description has been provided for this image
No description has been provided for this image
Transformation: sqrt
No description has been provided for this image
No description has been provided for this image
R² for each transformation:
log: 0.8815
sqrt: 0.9027

Observing the results from the plots, no clear improvement is noticeable.
Moreover, the R² values are similar to the original model.
Therefore, we prefer to continue our work with the non-transformed model.


Cross-Validation¶

Cross-validation helps evaluate how well the model generalizes to unseen data.
By testing the model on a separate validation set, we can detect overfitting and assess real-world performance.
In this case, we split the dataset into a training set Food_train and a validation set Food_valid.
We explicitly define the size of the validation set (430 samples), which is exactly half of the total dataset, to ensure a consistent and controlled evaluation.

In [49]:
Food_train , Food_valid = train_test_split(Food, test_size=430, random_state=0) 
In [50]:
categorical = ['Weather', 'Traffic_Level']
Food_train[categorical] = Food_train[categorical].astype('category')

cv_model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Preparation_Time_min',
    'Courier_Experience_yrs',
])

X_train = cv_model.fit_transform(Food_train)
y_train = Food_train['Delivery_Time_min']


model = sm.OLS(y_train, X_train)
results = model.fit()
In [51]:
X_valid = cv_model.fit_transform(Food_valid)
y_valid = Food_valid['Delivery_Time_min']

valid_pred = results.predict(X_valid)
MSE = np.mean((y_valid - valid_pred)**2)
RMSE = np.sqrt(MSE)

print("MSE", MSE) 
print("RMSE", RMSE)
MSE 48.8813195529826
RMSE 6.991517685952214
  • Leave-one-out cross-validation¶

In [52]:
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')

loocv_model = sklearn_sm(sm.OLS, MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Preparation_Time_min',
    'Courier_Experience_yrs',
]))

X = Food.drop(columns=['Delivery_Time_min'])
y = Food['Delivery_Time_min']

cv_results = cross_validate(loocv_model, X, y, cv=Food.shape[0])
MSE = np.mean(cv_results['test_score'])
RMSE = np.sqrt(MSE)

print("MSE", MSE) 
print("RMSE", RMSE)
MSE 44.70860072334475
RMSE 6.686449036921223

Leave-one-out cross-validation is more precise but the most expensive in terms of both time and computation.
This is because it requires the model to be trained multiple times, with each iteration leaving out a single data point for validation.
Indeed, the results obtained in terms of MSE and RMSE were slightly improved compared to the previous cross-validation, but it took approximately 30 seconds, compared to 0 seconds for the previous method.

  • K-fold cross-validation¶

In [53]:
categorical = ['Weather', 'Traffic_Level']
Food[categorical] = Food[categorical].astype('category')

kf_model = sklearn_sm(sm.OLS, MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Preparation_Time_min',
    'Courier_Experience_yrs',
]))

X = Food.drop(columns=['Delivery_Time_min'])
y = Food['Delivery_Time_min']

kf = KFold(n_splits=10, shuffle=True, random_state=1)
cv_results = cross_validate(kf_model, X, y, cv=kf)
MSE = np.mean(cv_results['test_score'])
RMSE = np.sqrt(MSE)

print("MSE", MSE) 
print("RMSE", RMSE)
MSE 45.05819307596635
RMSE 6.712539986917497

K-fold cross-validation strikes a balance between precision and computational cost.
It involves splitting the dataset into k subsets, and iteratively training and validating the model on different combinations of these subsets.
The results obtained from k-fold are positioned between those from the 50%-data cross-validation (which yielded slightly worse results) and the leave-one-out, which provided slightly better results.
However, the execution time is much closer to that of the initial cross-validation, being far less computationally expensive than LOOCV.


Shrinkage Methods¶

We continue the analysis of the model using shrinkage methods.
These methods allow setting the coefficients of variables to zero.
To do this, it makes sense to return to a model with all variables.
This means adding back the categorical variables Time_of_Day and Vehicle_Type.
It is also important to normalize the model, as this ensures all variables are on the same scale.

In [54]:
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')

model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Time_of_Day',
    'Vehicle_Type',
    'Preparation_Time_min',
    'Courier_Experience_yrs'
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']

numerical_columns = ['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs']

scaler = StandardScaler()
X[numerical_columns] = scaler.fit_transform(X[numerical_columns])

model = sm.OLS(y, X)
results = model.fit()

summarize(results)
Out[54]:
coef std err t P>|t|
intercept 59.5537 0.701 84.955 0.000
Distance_km 16.8426 0.228 73.852 0.000
Weather[Foggy] 6.3464 0.768 8.264 0.000
Weather[Rainy] 4.7687 0.593 8.044 0.000
Weather[Snowy] 9.8532 0.804 12.262 0.000
Weather[Windy] 2.8989 0.798 3.633 0.000
Traffic_Level[Low] -11.2475 0.623 -18.066 0.000
Traffic_Level[Medium] -5.3621 0.620 -8.653 0.000
Time_of_Day[Evening] -0.3425 0.591 -0.580 0.562
Time_of_Day[Morning] -0.5390 0.581 -0.927 0.354
Time_of_Day[Night] -1.4026 0.885 -1.585 0.113
Vehicle_Type[Car] -0.2829 0.610 -0.464 0.643
Vehicle_Type[Scooter] -0.7231 0.526 -1.374 0.170
Preparation_Time_min 7.2708 0.228 31.929 0.000
Courier_Experience_yrs -1.5712 0.228 -6.886 0.000
  • Ridge regression¶

We begin by splitting the database into train and test sets.
Next, we search for the optimal lambda for ridge regression.

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
lambda_values = np.logspace(-2, 4, 300) 
In [56]:
pipeline = make_pipeline(StandardScaler(), Ridge())

param_grid = {'ridge__alpha': lambda_values}
ridge_cv = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=10) # key-fold con 10 valori
ridge_cv.fit(X_train, y_train)
best_alpha = ridge_cv.best_params_['ridge__alpha']
print(f"Best Ridge lambda: {best_alpha}")
Best Ridge lambda: 1.7680381784572086

We evaluate the test set using the lambda found.

In [57]:
ridge_best = make_pipeline(StandardScaler(), Ridge(alpha=best_alpha))
ridge_best.fit(X_train, y_train)
ridge_pred = ridge_best.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)

coefficients = ridge_best.named_steps['ridge'].coef_

coeff_df = pd.DataFrame({'Variable': X_train.columns, 'Coefficient': coefficients})

print(coeff_df)
                  Variable  Coefficient
0                intercept     0.000000
1              Distance_km    16.921252
2           Weather[Foggy]     2.048377
3           Weather[Rainy]     1.621683
4           Weather[Snowy]     3.064244
5           Weather[Windy]     1.014522
6       Traffic_Level[Low]    -5.150850
7    Traffic_Level[Medium]    -2.279150
8     Time_of_Day[Evening]     0.109579
9     Time_of_Day[Morning]    -0.010385
10      Time_of_Day[Night]     0.175841
11       Vehicle_Type[Car]    -0.082883
12   Vehicle_Type[Scooter]    -0.197417
13    Preparation_Time_min     7.407239
14  Courier_Experience_yrs    -1.444747

From the results, we see that the less significant variables (also observed in previous models and methods) have coefficients very close to zero for ridge regression.
This occurs because ridge regression shrinks coefficients of less relevant variables toward zero, improving model generalization.
We also notice that the intercept is exactly zero.

In [58]:
MSE = ridge_mse
RMSE = np.sqrt(MSE)

print("MSE", MSE) 
print("RMSE", RMSE)
MSE 41.90806681166175
RMSE 6.473644013356137

The results in this case are similar to those obtained previously with the three cross-validation methods (50%, LOOCV, and KF), with a slight improvement of about 2% referring to the RMSE.

In [59]:
coefs = []
feature_names = X_train.columns 

for a in lambda_values:
    pipe = make_pipeline(StandardScaler(), Ridge(alpha=a))
    pipe.fit(X_train, y_train)
    coefs.append(pipe.named_steps['ridge'].coef_)

coefs = np.array(coefs)  

plt.figure(figsize=(14, 8))
ax = plt.gca()

colors = plt.cm.tab20(np.linspace(0, 1, len(feature_names)))

for idx, name in enumerate(feature_names):
    ax.plot(lambda_values, coefs[:, idx], label=name, color=colors[idx])

ax.set_xscale('log')  
plt.xlabel('Lambda')
plt.axvline(best_alpha, color='r', linestyle='--')  
plt.ylabel('Coefficients')
plt.title('Ridge Coefficients as Function of Regularization')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='small')

plt.tight_layout()
plt.ylim(-10, 20)
plt.xlim(1e-1, 1e4)  

plt.show()
No description has been provided for this image

The results of the coefficients can also be seen in this graph, at the point corresponding to the red dashed line, which indicates the optimal lambda.
It is noticeable that the important variables are those farthest from zero.
It is also noticeable that ridge regression drives the coefficients toward zero as lambda increases.

In [60]:
mean_test_scores = -ridge_cv.cv_results_['mean_test_score'] 
alphas = ridge_cv.param_grid['ridge__alpha']

plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_test_scores)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.xlabel('Lambda')
plt.ylabel('Mean Squared Error')
plt.title('Ridge Cross-Validation Error')
plt.ylim(48.63, 48.65)
plt.xlim(1e-1, 1e2)
plt.show()
No description has been provided for this image

In this other plot, we can see that the minimum MSE is found at the optimal lambda.

  • Lasso regression¶

We will now proceed in the same manner with lasso regression.

In [61]:
lasso_pipe = make_pipeline(StandardScaler(), Lasso(max_iter=10000))

lasso_cv = GridSearchCV(lasso_pipe, 
                       {'lasso__alpha': lambda_values},
                       scoring='neg_mean_squared_error',
                       cv=10)
lasso_cv.fit(X_train, y_train)
best_alpha = lasso_cv.best_params_['lasso__alpha']
print(f"Best Lasso lambda: {best_alpha}")
Best Lasso lambda: 0.10553861030365236
In [62]:
lasso_best = lasso_cv.best_estimator_
lasso_pred = lasso_best.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)

print(f"Number of initial coefficients: {len(feature_names)}")

final_coefs = lasso_best.named_steps['lasso'].coef_
print(f"Number of non-zero coefficients: {np.sum(final_coefs != 0)}")

coeff_df = pd.DataFrame({'Variable': feature_names, 'Coefficient': final_coefs})
print(coeff_df)
Number of initial coefficients: 15
Number of non-zero coefficients: 11
                  Variable  Coefficient
0                intercept     0.000000
1              Distance_km    16.894242
2           Weather[Foggy]     1.879118
3           Weather[Rainy]     1.423975
4           Weather[Snowy]     2.893828
5           Weather[Windy]     0.823138
6       Traffic_Level[Low]    -4.905907
7    Traffic_Level[Medium]    -2.025254
8     Time_of_Day[Evening]     0.000000
9     Time_of_Day[Morning]    -0.000000
10      Time_of_Day[Night]     0.048237
11       Vehicle_Type[Car]    -0.000000
12   Vehicle_Type[Scooter]    -0.084120
13    Preparation_Time_min     7.326051
14  Courier_Experience_yrs    -1.307976

With lasso regression, the less significant variables are driven exactly to zero, allowing for model simplification.
Similarly, the intercept remains zero in this case as well.

In [63]:
MSE = lasso_mse
RMSE = np.sqrt(MSE)

print("MSE", MSE) 
print("RMSE", RMSE)
MSE 42.21212509283608
RMSE 6.497085892370215

The results compared to ridge regression are practically identical.

In [64]:
coefs = []

for a in lambda_values:
    pipe = make_pipeline(StandardScaler(), Lasso(alpha=a, max_iter=10000))
    pipe.fit(X_train, y_train)
    coefs.append(pipe.named_steps['lasso'].coef_)

coefs = np.array(coefs)  

plt.figure(figsize=(14, 8))
ax = plt.gca()

colors = plt.cm.tab20(np.linspace(0, 1, len(feature_names)))

for idx, name in enumerate(feature_names):
    ax.plot(lambda_values, coefs[:, idx], label=name, color=colors[idx])

ax.set_xscale('log')
plt.xlabel('Lambda')
plt.ylabel('Coefficients')
plt.title('Lasso Coefficients as Function of Regularization')
plt.axvline(best_alpha, color='r', linestyle='--') 
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='small')  
plt.xlim(1e-2, 1e2)
plt.tight_layout()
plt.show()
No description has been provided for this image

Unlike the same plot for ridge regression, we notice that in lasso regression, the coefficients go exactly to zero as the optimal lambda increases.

In [65]:
mean_test_scores = -lasso_cv.cv_results_['mean_test_score'] 
alphas = lasso_cv.param_grid['lasso__alpha']

plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_test_scores)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.xlabel('Lambda')
plt.ylabel('Mean Squared Error')
plt.title('Lasso Cross-Validation Error')
plt.ylim(48, 50)
plt.xlim(1e-2, 1e1)
plt.show()
No description has been provided for this image

As with the optimal lambda for ridge regression, the optimal lambda for lasso regression is also at the minimum of the MSE function.


Classification Trees¶

Classification trees are decision tree models that split the data into regions based on input variables, aiming to predict the output by minimizing variability within each region.
We return to a non-normalized model, as normalization is not needed because tree-based methods are not sensitive to the scale of the variables.
As with previous methods, we split the data into train and test sets and use a DecisionTreeRegressor.

In [66]:
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')

model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Time_of_Day',
    'Vehicle_Type',
    'Preparation_Time_min',
    'Courier_Experience_yrs'
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=1
)
In [67]:
tree_model = DecisionTreeRegressor(random_state=2)
tree_model.fit(X_train, y_train)
Out[67]:
DecisionTreeRegressor(random_state=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=2)
In [68]:
feature_importances = tree_model.feature_importances_
feature_names = tree_model.feature_names_in_

features_with_importances = list(zip(feature_names, feature_importances))

print("Regression Tree:")
print(f"Tree depth: {tree_model.get_depth()}")
print(f"Number of leaves: {tree_model.get_n_leaves()}")
print("Feature importances with names:")
for feature, importance in features_with_importances:
    print(f"{feature}: {importance}")
Regression Tree:
Tree depth: 15
Number of leaves: 391
Feature importances with names:
intercept: 0.0
Distance_km: 0.7395694133585553
Weather[Foggy]: 0.007631080736828054
Weather[Rainy]: 0.00344948034208744
Weather[Snowy]: 0.012462360581793706
Weather[Windy]: 0.005930637070555485
Traffic_Level[Low]: 0.024642504880278127
Traffic_Level[Medium]: 0.002713006906142687
Time_of_Day[Evening]: 0.005121012228573312
Time_of_Day[Morning]: 0.0031360716319012547
Time_of_Day[Night]: 0.00165402628289708
Vehicle_Type[Car]: 0.001991788688687997
Vehicle_Type[Scooter]: 0.005642427790877079
Preparation_Time_min: 0.163347385261935
Courier_Experience_yrs: 0.022708804238887673

The resulting tree is very complex, with 15 levels and 391 leaves, suggesting strong overfitting.
Regarding variable importance information, the results are consistent with those observed in the previous models.

In [69]:
plt.figure(figsize=(50, 50)) 
plot_tree(tree_model, 
          feature_names=X.columns, 
          filled=True, 
          rounded=True, 
          proportion=True,  
          fontsize=12,  
          precision=2)  

plt.title('Regression Tree', fontsize=16) 
plt.show()
No description has been provided for this image

The plot confirms the enormous complexity of the tree.

In [70]:
y_pred = tree_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.4f}")
print(f"R2: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {np.sqrt(mse):.4f}")
MAE: 8.5419
R2: 0.6977
MSE: 124.6488
RMSE: 11.1646

The residual analysis shows worse results compared to the previous models, although worse performance was expected given the enormous size of the tree.
We proceed to simplify the tree by studying the maximum depth.

In [71]:
max_depth_values = range(1, 100) 
cv_scores = []

for depth in max_depth_values:
    tree_cv = DecisionTreeRegressor(criterion='squared_error', max_depth=depth, random_state=2)
    scores = cross_val_score(tree_cv, X_train, y_train, cv=10, scoring='r2')
    cv_scores.append(np.mean(scores))
In [72]:
plt.figure(figsize=(10, 6))
plt.plot(max_depth_values, cv_scores, 'o-')
plt.xlabel('Max Depth')
plt.ylabel('Cross-Validation R²')
plt.title('Cross-Validation Results for Tree Pruning')
plt.grid(True)
plt.xlim(0, 100)
plt.show()
No description has been provided for this image

The graph shows the effect of the decision tree depth on performance, using cross-validation and the R² score.
The results show an initial increase in R² followed by a decrease, indicating the optimization between underfitting and overfitting.
The optimal depth seems to be 4, as this is where R² is maximum.

In [73]:
best_depth = max_depth_values[np.argmax(cv_scores)]
print(f"Best depth from cross-validation: {best_depth}")
Best depth from cross-validation: 4

The optimal depth of 4 is confirmed, so we proceed with pruning.

In [74]:
pruned_tree = DecisionTreeRegressor(max_depth=best_depth, random_state=2)
pruned_tree.fit(X_train, y_train)
Out[74]:
DecisionTreeRegressor(max_depth=4, random_state=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=4, random_state=2)
In [75]:
plt.figure(figsize=(50, 50))
plot_tree(pruned_tree, 
            feature_names=X.columns, 
            filled=True, 
            rounded=True,
            proportion=True,  
            fontsize=13,  
            precision=2)
plt.title(f'Pruned Regression Tree (Max Depth = {best_depth})')
plt.show()
No description has been provided for this image

The pruned tree is much simpler and easier to explain.
The only variables considered in the end are Distance_km and Preparation_Time_min.
Dark orange indicate more extreme predicted values (higher or lower), while lighter or white colors indicate values close to the target mean.
These do not represent categories, but rather continuous value gradations.

In [76]:
y_pred_pruned = pruned_tree.predict(X_test)

mae = mean_absolute_error(y_test, y_pred_pruned)
mse = mean_squared_error(y_test, y_pred_pruned)
r2 = r2_score(y_test, y_pred_pruned)

print(f"MAE: {mae:.4f}")
print(f"R2: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {np.sqrt(mse):.4f}")
MAE: 7.5838
R2: 0.7749
MSE: 92.8263
RMSE: 9.6346

With pruning, the results have also improved.

  • Bagging¶

We now try using ensemble methods for trees, starting with Bagging.
Bagging builds multiple trees on different random subsets of the data and averages their predictions.
This technique reduces variance and helps to improve model stability and accuracy.

In [77]:
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
Food[categorical] = Food[categorical].astype('category')

model = MS([
    'Distance_km',
    'Weather',
    'Traffic_Level',
    'Time_of_Day',
    'Vehicle_Type',
    'Preparation_Time_min',
    'Courier_Experience_yrs'
])

X = model.fit_transform(Food)
y = Food['Delivery_Time_min']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=1
)

For Bagging, we use a Random Forest model with max features set equal to the total number of features.
This ensures that each tree is built using all available variables, maintaining the standard Bagging approach where randomness comes only from the data sampling, not from feature selection.

In [78]:
bagg_model = RandomForestRegressor(
    n_estimators=500, 
    max_features=X.shape[1], 
    random_state=1,
    bootstrap=True  
)
bagg_model.fit(X_train, y_train)
Out[78]:
RandomForestRegressor(max_features=15, n_estimators=500, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=15, n_estimators=500, random_state=1)
In [79]:
error_rate = []
X_test_array = X_test.to_numpy()

for i in range(1, 501, 10):
    y_pred_bagg = np.mean([t.predict(X_test_array) for t in bagg_model.estimators_[:i]], axis=0)
    error_rate.append(mean_squared_error(y_test, y_pred_bagg))

plt.figure(figsize=(10, 6))
plt.plot(range(1, 501, 10), error_rate, 'r-', lw=2, label='Bagging')
plt.xlabel('Number of Trees')
plt.ylabel('Test MSE')
plt.title('Test MSE vs Number of Trees (Bagging)')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

In this graph, we see that as the number of trees increases, the MSE decreases.
In particular, it stabilizes from around 100 trees onward.

In [80]:
y_pred_bagg = bagg_model.predict(X_test)
mse_bagg = mean_squared_error(y_test, y_pred_bagg)
print(f"Bagging MSE: {mse_bagg:.4f}")
print(f"Bagging RMSE: {np.sqrt(mse_bagg):.4f}")
Bagging MSE: 65.7354
Bagging RMSE: 8.1077

The residual analysis shows better results compared to previous tree-based models.

In [81]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_bagg)
plt.plot([0, 100], [0, 100], 'k--')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Bagging: Actual vs Predicted')
plt.grid(True)
plt.show()
No description has been provided for this image

If the points lie on the bisector, the model is working well.
The model estimated with Bagging is accurate.
It performs better than a single tree but is still worse than linear models.
Some outliers are present.

In [82]:
importance_bagg = pd.Series(bagg_model.feature_importances_, index=X.columns)
importance_bagg.sort_values(ascending=False, inplace=True)
plt.figure(figsize=(10, 6))
importance_bagg.plot(kind='bar')
plt.title('Feature Importance (Bagging)')
plt.tight_layout()
plt.show()
No description has been provided for this image

In this other plot, we see that the variable Distance_km dominates, followed by Preparation_Time_min, although with less than half the importance of the first.
These results were expected considering the node values of the pruned tree built earlier.

  • Random forest¶

Using Random Forest, which is an extension of Bagging, we introduce an additional layer of randomness by selecting a random subset of features at each split, rather than considering all features.
This helps to reduce correlation between trees, improving the model's generalization and reducing overfitting compared to Bagging, where all features are used in each tree split.

In [83]:
forest_model = RandomForestRegressor(
    n_estimators=100, 
    max_features='sqrt',  
    random_state=1
)
forest_model.fit(X_train, y_train)
Out[83]:
RandomForestRegressor(max_features='sqrt', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features='sqrt', random_state=1)
In [84]:
error_rate_rf = []
X_test_array = X_test.to_numpy() 

for i in range(1, 501, 10):
    y_pred_rf = np.mean([t.predict(X_test_array) for t in forest_model.estimators_[:i]], axis=0)
    error_rate_rf.append(mean_squared_error(y_test, y_pred_rf))
In [85]:
plt.figure(figsize=(10, 6))
plt.plot(range(1, 501, 10), error_rate_rf, 'g-', label='Random Forest')
plt.plot(range(1, 501, 10), error_rate, 'r-', lw=2, label='Bagging')
plt.xlabel('Number of Trees')
plt.ylabel('Test MSE')
plt.title('Test MSE vs Number of Trees')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

In our case, as seen from the plot, Random Forest performs worse than Bagging.
This could be due to the additional feature randomness in Random Forest, which may lead to less effective splits compared to Bagging, where all features are considered at each split.

In [86]:
y_pred_rf = forest_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest MSE: {mse_rf:.4f}")
print(f"Random Forest RMSE: {np.sqrt(mse_rf):.4f}")
Random Forest MSE: 84.8856
Random Forest RMSE: 9.2133

The results being worse are confirmed by the residual analysis.

In [87]:
importance_rf = pd.Series(forest_model.feature_importances_, index=X.columns)
importance_rf.sort_values(ascending=False, inplace=True)
plt.figure(figsize=(10, 6))
importance_rf.plot(kind='bar')
plt.title('Feature Importance (Random Forest)')
plt.tight_layout()
plt.show()
No description has been provided for this image

With Random Forest, we force the other variables to play a more significant role by randomly selecting subsets of features for each tree split.
This randomness encourages the model to explore different feature combinations, ensuring that no single feature dominates the decision-making process, as opposed to Bagging, where all features are used in each split.

  • Boosting¶

As the last tree-based method, we use Boosting.
Boosting is an ensemble technique that builds multiple trees sequentially, where each tree corrects the errors of the previous one.

In [88]:
boost_model = GradientBoostingRegressor(
    n_estimators=5000,
    max_depth=4,
    learning_rate=0.001,
    loss='squared_error',  
    random_state=1
)
boost_model.fit(X_train, y_train)
Out[88]:
GradientBoostingRegressor(learning_rate=0.001, max_depth=4, n_estimators=5000,
                          random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.001, max_depth=4, n_estimators=5000,
                          random_state=1)
In [89]:
y_pred_boost = boost_model.predict(X_test)
mse_boost = mean_squared_error(y_test, y_pred_boost)
print(f"Boosting MSE: {mse_boost:.4f}")
print(f"Boosting RMSE: {np.sqrt(mse_boost):.4f}")

methods = ['Bagging', 'Random Forest', 'Boosting']
mse_values = [mse_bagg, mse_rf, mse_boost]
Boosting MSE: 60.2433
Boosting RMSE: 7.7617

Based on the results, it appears that Boosting is the most effective tree-based method.
However, it still underperforms compared to the initial linear regression models, which remain the most accurate for this dataset.

In [90]:
plt.figure(figsize=(10, 6))
plt.bar(methods, mse_values)
plt.ylabel('Test MSE')
plt.title('Comparison of Tree-Based Methods')
for i, v in enumerate(mse_values):
    plt.text(i, v + 0.1, f"{v:.4f}", ha='center')
plt.tight_layout()
plt.show()
No description has been provided for this image

The graph compares the performance of three ensemble methods: Bagging, Random Forest, and Boosting. As shown, Boosting achieving slightly better results.